This is to certify that the work I am submitting is my own. All external references and sources are clearly acknowledged and identified within the contents. I am aware of the University of Warwick regulation concerning plagiarism and collusion.
No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done an appropriate reduction in the mark I might otherwise have received will be made
This report is designed to meet the requests of a panel of politicians and managers of the Food Standards Agency, undertaking the specific analyses requested. This report aims to:
Visualize the distributions across local authorities of the % of enforcement actions successfully achieved.
Examine whether employing more professional enforcement officers increases the likelihood of establishments successfully responding to enforcement actions by:
Determining whether there is a relationship between proportion of successful responses and the number of FTE food safety employees in each local authority.
Determining whether there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority
options(width=100)
#Installing necessary packages
install.packages("moments")
install.packages("car")
install.packages('tidyverse')
install.packages('gridExtra')
install.packages('grid')
install.packages("ggpubr")
install.packages("MASS")
install.packages("lmtest")
install.packages('lubridate')
install.packages('emmeans')
install.packages('kableExtra')
install.packages('Rmisc')
install.packages('Hmisc')
#importing libraries
library(car)
library(Rmisc)
library(kableExtra)
library(tidyverse)
library(lubridate)
library(gridExtra)
library(emmeans)
library(dplyr)
library(moments)
library(grid)
library(ggpubr)
library(MASS)
library(lmtest)
This dataset is provided by the Food Standards Agency. Each row of the dataset contains details for a particular local Authority. The variables in the dataset as described below.
| Original Variable Name | New Variable Name | Description |
|---|---|---|
| Country | Country | Country in which the Local Authority is in |
| LAType | LAType | Type of the local Authority |
| LAName | LAName | Name of the local authority. |
| Totalestablishments(includingnotyetrated&outside) | Total.establishments | Total number of establishments within the local authority (including those that were yet to be rated and those which are outside the rating programme) |
| Establishmentsnotyetratedforintervention | Establishments.not.yet.rated | Number of establishments that are yet to be rated for intervention but are part of the programme |
| Establishmentsoutsidetheprogramme | Establishments.outside.the.programme | Number of establishments that are outside the programme |
| Total%ofBroadlyCompliantestablishmentsratedA-E | perc.of.broadly.comp.rated.est | Percentage of Rated Establishments that are broadly compliant with the regulations out of the total number of Rated Establishments in that Local Authority. |
| Total%ofBroadlyCompliantestablishments(includingnotyetrated) | perc.of.broadly.comp.rated.and.unrated.est | Percentage of Rated Broadly Compliant Establishments out of total number of establishments that are part of the programme, including those which were not yet rated. |
| Aratedestablishments | A.rated.establishments | Number of Establishments that are rated A (greatest potential impact) in that Local Authority. |
| Total%ofBroadlyCompliantestablishments-A | perc.of.broadly.comp.A.rated.est | Percentage of establishments that are broadly compliant out of total number of establishments that were rated A in that Local Authority. |
| Bratedestablishments | B.rated.establishments | Number of establishments that are rated B in the Area |
| Total%ofBroadlyCompliantestablishments-B | perc.of.broadly.comp.B.rated.est | Percentage of establishments that are broadly compliant out of total number of establishments that were rated B in that Local Authority. |
| Cratedestablishments | C.rated.establishments | Number of establishments that are rated C in the Area |
| Total%ofBroadlyCompliantestablishments-C | perc.of.broadly.comp.C.rated.est | Percentage of establishments that are broadly compliant out of total number of establishments that were rated C in that Local Authority. |
| Dratedestablishments | D.rated.establishments | Number of establishments that are rated D in the Area |
| Total%ofBroadlyCompliantestablishments-D | perc.of.broadly.comp.D.rated.est | Percentage of D rated establishments that are broadly compliant out of total number of establishments that were rated D in that Local Authority. |
| Eratedestablishments | E.rated.establishments | Number of establishments that are rated E (least potential impact) in the Area |
| Total%ofBroadlyCompliantestablishments-E | perc.of.broadly.comp.E.rated.est | Percentage of E rated establishments that are broadly compliant out of total number of establishments that were rated E in that Local Authority. |
| Total%ofInterventionsachieved(premisesratedA-E) | perc.interventions.achieved.by.rated.est | Total Percentage of Interventions achieved by rated establishments out of the total number of interventions received by rated establishments |
| Total%ofInterventionsachieved-premisesratedA | perc.interventions.achieved.by.A.rated.est | Total Percentage of Interventions achieved by A rated establishments out of the total number of interventions received by A-rated establishments |
| Total%ofInterventionsachieved-premisesratedB | perc.interventions.achieved.by.B.rated.est | Total Percentage of Interventions achieved by B rated establishments out of the total number of interventions received by B-rated establishments |
| Total%ofInterventionsachieved-premisesratedC | perc.interventions.achieved.by.C.rated.est | Total Percentage of Interventions achieved by C rated establishments out of the total number of interventions received by C-rated establishments |
| Total%ofInterventionsachieved-premisesratedD | perc.interventions.achieved.by.D.rated.est | Total Percentage of Interventions achieved by D rated establishments out of the total number of interventions received by D-rated establishments. |
| Total%ofInterventionsachieved-premisesratedE | perc.interventions.achieved.by.E.rated.est | Total Percentage of Interventions achieved by E rated establishments out of the total number of interventions received by E-rated establishments |
| Total%ofInterventionsachieved-premisesnotyetrated | perc.interventions.achieved.by.unrated.est | Total Percentage of Interventions achieved by establishments which were not yet rated out of the total number of interventions received by establishments which were not yet rated. |
| Totalnumberofestablishmentssubjecttoformalenforcementactions-Voluntaryclosure | est.subject.to.enforcement.Voluntary.closure | Total Number of Establishments that were subject to Voluntary closure as a formal enforcement action |
| Totalnumberofestablishmentssubjecttoformalenforcementactions-Seizure,detention&surrenderoffood | est.subject.to.enforcement.Seizure.detention.and.surrender.of.food | Total Number of Establishments that are subject to Seizures, Detentions or Surrender of food as formal enforcement actions. |
| Totalnumberofestablishmentssubjecttoformalenforcementactions-Suspension/revocationofapprovalorlicence | est.subject.to.enforcement.Suspension.or.revocation.of.approval.or.licence | Total Number of Establishments that are subject to Suspension or revocation of their approval licence as formal enforcement actions. |
| Totalnumberofestablishmentssubjecttoformalenforcementactions-Hygieneemergencyprohibitionnotice | est.subject.to.enforcement.Hygiene.prohibition.notice | Total Number of Establishments that are subject to getting Hygiene Emergency Prohibition Notice as formal enforcement actions. |
| Totalnumberofestablishmentssubjecttoformalenforcementactions-Prohibitionorder | est.subject.to.enforcement.Prohibition.order | Total Number of Establishments that are subject to getting a Prohibition Order as formal enforcement actions. |
| Totalnumberofestablishmentssubjecttoformalenforcementactions-Simplecaution | est.subject.to.enforcement.Simple.caution | Total Number of Establishments that were subject to Simple Caution as formal enforcement actions. |
| Totalnumberofestablishmentssubjecttoformalenforcementactions-Hygieneimprovementnotices | est.subject.to.enforcement.Hygiene.improvement.notices | Total Number of Establishments that are subject to Hygiene improvement notices as formal enforcement actions. |
| Totalnumberofestablishmentssubjecttoformalenforcementactions-Remedialaction&detentionnotices | est.subject.to.enforcement.Remedial.action.and.detention.notices | Total Number of Establishments that are subject to remedial action and detention notices as formal enforcement actions. |
| TotalnumberofestablishmentssubjecttoWrittenwarnings | est.subject.to.written.warnings | Total Number of Establishments that were subject to Written Warnings. |
| Totalnumberofestablishmentssubjecttoformalenforcementactions-Prosecutionsconcluded | est.subject.to.enforcement.Prosecutions.concluded | Total Number of Establishments for which Prosecutions were concluded. |
| ProfessionalFullTimeEquivalentPosts-occupied | Number.of.FTE.food.safety.employees | Total Number of Professional Full Time Posts that are currently occupied in the Local Authority Office |
#Reading the data
dataset <- read_csv("2019-20-enforcement-data-food-hygiene.csv")
#Renaming the columns with the new names
names(dataset) <- c("Country", "LAType", "LAName", "Total.establishments", "Establishments.not.yet.rated", "Establishments.outside.the.programme", "perc.of.broadly.comp.rated.est", "perc.of.broadly.comp.rated.and.unrated.est", "A.rated.establishments", "perc.of.broadly.comp.A.rated.est", "B.rated.establishments", "perc.of.broadly.comp.B.rated.est", "C.rated.establishments", "perc.of.broadly.comp.C.rated.est","D.rated.establishments","perc.of.broadly.comp.D.rated.est","E.rated.establishments","perc.of.broadly.comp.E.rated.est", "perc.interventions.achieved.by.rated.est", "perc.interventions.achieved.by.A.rated.est","perc.interventions.achieved.by.B.rated.est","perc.interventions.achieved.by.C.rated.est","perc.interventions.achieved.by.D.rated.est","perc.interventions.achieved.by.E.rated.est", "perc.interventions.achieved.by.unrated.est", "est.subject.to.enforcement.Voluntary.closure", "est.subject.to.enforcement.Seizure.detention.and.surrender.of.food", "est.subject.to.enforcement.Suspension.or.revocation.of.approval.or.licence", "est.subject.to.enforcement.Hygiene.prohibition.notice", "est.subject.to.enforcement.Prohibition.order", "est.subject.to.enforcement.Simple.caution", "est.subject.to.enforcement.Hygiene.improvement.notices", "est.subject.to.enforcement.Remedial.action.and.detention.notices", "est.subject.to.written.warnings", "est.subject.to.enforcement.Prosecutions.concluded", "Number.of.FTE.food.safety.employees" )
#Checking its Summary
summary(dataset)
## Country LAType LAName Total.establishments
## Length:353 Length:353 Length:353 Min. : 145.0
## Class :character Class :character Class :character 1st Qu.: 920.5
## Mode :character Mode :character Mode :character Median :1330.0
## Mean :1620.7
## 3rd Qu.:2004.5
## Max. :9277.0
## NA's :6
## Establishments.not.yet.rated Establishments.outside.the.programme perc.of.broadly.comp.rated.est
## Min. : 0.00 Min. : 0.00 Min. : 74.61
## 1st Qu.: 25.00 1st Qu.: 0.00 1st Qu.: 95.37
## Median : 49.00 Median : 2.00 Median : 97.13
## Mean : 89.75 Mean : 49.62 Mean : 96.33
## 3rd Qu.: 100.00 3rd Qu.: 39.00 3rd Qu.: 98.19
## Max. :1744.00 Max. :865.00 Max. :100.00
## NA's :6 NA's :6 NA's :6
## perc.of.broadly.comp.rated.and.unrated.est A.rated.establishments perc.of.broadly.comp.A.rated.est
## Min. :69.45 Min. : 0.000 Length:353
## 1st Qu.:89.23 1st Qu.: 1.000 Class :character
## Median :92.80 Median : 2.000 Mode :character
## Mean :91.54 Mean : 4.285
## 3rd Qu.:95.16 3rd Qu.: 5.000
## Max. :99.87 Max. :72.000
## NA's :6 NA's :6
## B.rated.establishments perc.of.broadly.comp.B.rated.est C.rated.establishments
## Min. : 2.00 Min. : 5.88 Min. : 18.0
## 1st Qu.: 23.00 1st Qu.: 55.62 1st Qu.: 144.0
## Median : 39.00 Median : 69.23 Median : 225.0
## Mean : 55.61 Mean : 67.34 Mean : 302.6
## 3rd Qu.: 68.50 3rd Qu.: 80.15 3rd Qu.: 376.0
## Max. :516.00 Max. :100.00 Max. :1647.0
## NA's :6 NA's :6 NA's :6
## perc.of.broadly.comp.C.rated.est D.rated.establishments perc.of.broadly.comp.D.rated.est
## Min. : 71.96 Min. : 56.0 Min. : 75.74
## 1st Qu.: 89.36 1st Qu.: 307.0 1st Qu.: 97.94
## Median : 92.86 Median : 462.0 Median : 99.03
## Mean : 92.02 Mean : 553.6 Mean : 98.38
## 3rd Qu.: 95.50 3rd Qu.: 682.0 3rd Qu.: 99.60
## Max. :100.00 Max. :3053.0 Max. :100.00
## NA's :6 NA's :6 NA's :6
## E.rated.establishments perc.of.broadly.comp.E.rated.est perc.interventions.achieved.by.rated.est
## Min. : 63.0 Min. : 79.22 Min. : 20.64
## 1st Qu.: 362.5 1st Qu.: 99.83 1st Qu.: 81.81
## Median : 480.0 Median :100.00 Median : 90.82
## Mean : 565.3 Mean : 99.82 Mean : 86.62
## 3rd Qu.: 667.5 3rd Qu.:100.00 3rd Qu.: 95.39
## Max. :4309.0 Max. :100.00 Max. :100.00
## NA's :6 NA's :6 NA's :6
## perc.interventions.achieved.by.A.rated.est perc.interventions.achieved.by.B.rated.est
## Length:353 Min. : 50.00
## Class :character 1st Qu.: 93.53
## Mode :character Median : 97.75
## Mean : 95.25
## 3rd Qu.:100.00
## Max. :100.00
## NA's :6
## perc.interventions.achieved.by.C.rated.est perc.interventions.achieved.by.D.rated.est
## Min. : 18.37 Min. : 19.77
## 1st Qu.: 89.27 1st Qu.: 82.00
## Median : 94.97 Median : 91.72
## Mean : 91.84 Mean : 86.32
## 3rd Qu.: 97.77 3rd Qu.: 95.98
## Max. :100.00 Max. :100.00
## NA's :6 NA's :6
## perc.interventions.achieved.by.E.rated.est perc.interventions.achieved.by.unrated.est
## Min. : 1.81 Min. : 6.56
## 1st Qu.: 65.39 1st Qu.: 85.81
## Median : 87.50 Median : 99.75
## Mean : 77.37 Mean : 90.95
## 3rd Qu.: 95.90 3rd Qu.:100.00
## Max. :100.00 Max. :100.00
## NA's :6 NA's :6
## est.subject.to.enforcement.Voluntary.closure
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 1.000
## Mean : 2.712
## 3rd Qu.: 3.000
## Max. :62.000
## NA's :6
## est.subject.to.enforcement.Seizure.detention.and.surrender.of.food
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 1.199
## 3rd Qu.: 1.000
## Max. :52.000
## NA's :6
## est.subject.to.enforcement.Suspension.or.revocation.of.approval.or.licence
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.06916
## 3rd Qu.:0.00000
## Max. :7.00000
## NA's :6
## est.subject.to.enforcement.Hygiene.prohibition.notice est.subject.to.enforcement.Prohibition.order
## Min. : 0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.:0.0000
## Median : 0.0000 Median :0.0000
## Mean : 0.7118 Mean :0.1354
## 3rd Qu.: 0.0000 3rd Qu.:0.0000
## Max. :42.0000 Max. :6.0000
## NA's :6 NA's :6
## est.subject.to.enforcement.Simple.caution est.subject.to.enforcement.Hygiene.improvement.notices
## Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.000
## Median : 0.0000 Median : 4.000
## Mean : 0.4409 Mean : 7.527
## 3rd Qu.: 0.0000 3rd Qu.: 9.000
## Max. :14.0000 Max. :77.000
## NA's :6 NA's :6
## est.subject.to.enforcement.Remedial.action.and.detention.notices est.subject.to.written.warnings
## Min. :0.0000 Min. : 30.0
## 1st Qu.:0.0000 1st Qu.: 195.0
## Median :0.0000 Median : 340.0
## Mean :0.3314 Mean : 437.0
## 3rd Qu.:0.0000 3rd Qu.: 543.5
## Max. :9.0000 Max. :3061.0
## NA's :6 NA's :6
## est.subject.to.enforcement.Prosecutions.concluded Number.of.FTE.food.safety.employees
## Min. : 0.0000 Min. : 0.65
## 1st Qu.: 0.0000 1st Qu.: 2.50
## Median : 0.0000 Median : 3.41
## Mean : 0.6657 Mean : 4.10
## 3rd Qu.: 1.0000 3rd Qu.: 5.00
## Max. :25.0000 Max. :22.13
## NA's :6 NA's :6
# Null Value Inspection and Treatment
# Filtering those rows having NA values in the data set. Note: Number of NA values are 6 in each continuous data column according to above summary.
filter(dataset, complete.cases(dataset) == FALSE)
## # A tibble: 6 × 36
## Country LAType LAName Total…¹ Estab…² Estab…³ perc.…⁴ perc.…⁵ A.rat…⁶ perc.…⁷ B.rat…⁸ perc.…⁹
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 England District C… Broxb… NA NA NA NA NA NA <NA> NA NA
## 2 England District C… Chorl… NA NA NA NA NA NA <NA> NA NA
## 3 England District C… East … NA NA NA NA NA NA <NA> NA NA
## 4 England District C… Hyndb… NA NA NA NA NA NA <NA> NA NA
## 5 England District C… Tamwo… NA NA NA NA NA NA <NA> NA NA
## 6 England Metropolit… West … NA NA NA NA NA NA <NA> NA NA
## # … with 24 more variables: C.rated.establishments <dbl>, perc.of.broadly.comp.C.rated.est <dbl>,
## # D.rated.establishments <dbl>, perc.of.broadly.comp.D.rated.est <dbl>,
## # E.rated.establishments <dbl>, perc.of.broadly.comp.E.rated.est <dbl>,
## # perc.interventions.achieved.by.rated.est <dbl>,
## # perc.interventions.achieved.by.A.rated.est <chr>,
## # perc.interventions.achieved.by.B.rated.est <dbl>,
## # perc.interventions.achieved.by.C.rated.est <dbl>, …
#Deleting those rows
dataset <- drop_na(dataset)
#Checking for duplicates
length(unique(dataset$LAName)) == nrow(dataset)
## [1] TRUE
#number of records
nrow(dataset)
## [1] 347
#Converting categorical columns to factors
categorical_columns <- c("Country", "LAType")
dataset[categorical_columns] <- lapply(dataset[categorical_columns], as.factor)
#Deleting Redundant variable LAName from dataset
dataset$LAName <- NULL
#Checking for any alphabets in both of the columns "perc.of.broadly.comp.A.rated.est" and "perc.interventions.achieved.by.A.rated.est"
characters_1 <- filter(dataset, grepl("^[A-Za-z]+$",dataset$perc.of.broadly.comp.A.rated.est))
levels(factor(characters_1$perc.of.broadly.comp.A.rated.est))
## [1] "NP"
characters_2 <- filter(dataset, grepl("^[A-Za-z]+$",dataset$perc.interventions.achieved.by.A.rated.est))
levels(factor(characters_2$perc.interventions.achieved.by.A.rated.est))
## [1] "NR"
#NP: No premises given at this risk rating
#NR: No interventions due or reported.
#Number of rows where dataset does not have NP and NR values
nrow(filter(dataset, perc.of.broadly.comp.A.rated.est == "NP" | perc.interventions.achieved.by.A.rated.est == "NR"))
## [1] 57
#Creating a seperate tibble where NR values are not there for visualizing purposes for perc.interventions.achieved.by.A.rated.est
visualization.tibble.for.A.rated <- dataset %>%
filter(perc.interventions.achieved.by.A.rated.est != "NR")
#Converting this column into numeric
visualization.tibble.for.A.rated$perc.interventions.achieved.by.A.rated.est <- as.numeric(visualization.tibble.for.A.rated$perc.interventions.achieved.by.A.rated.est)
#Visualizing the Distribution of the % of enforcement actions successfully achieved by all rated premises combined, across all Local Authorities
annotations <- data.frame(x = c(round(mean(dataset$perc.interventions.achieved.by.rated.est), 2)),y = c(0.08), label = c("Mean "))
ggplot(dataset, aes(perc.interventions.achieved.by.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "lightgrey") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby all premises', caption = "Figure 1. The distribution of enforcement actions successfully achieved across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)
#Visualizing the Distribution of the % of enforcement actions successfully achieved by A, B, C, D and E rated premises seperately across all Local Authorities
#Creating Histogram for A rated establishments
annotations_A <- data.frame(x = c(round(mean(visualization.tibble.for.A.rated$perc.interventions.achieved.by.A.rated.est), 2)),y = c(0.68), label = c("Mean "))
Plot_A <- ggplot(visualization.tibble.for.A.rated, aes(perc.interventions.achieved.by.A.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "red") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby A-rated premises', caption = "Figure 2. The distribution of enforcement actions successfully achieved by A-Rated premises across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.A.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations_A, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)
#Creating Histogram for B rated establishments
annotations_B <- data.frame(x = c(round(mean(dataset$perc.interventions.achieved.by.B.rated.est), 2)),y = c(0.4), label = c("Mean "))
Plot_B <- ggplot(dataset, aes(perc.interventions.achieved.by.B.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "orange") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby B-rated premises', caption = "Figure 3. The distribution of enforcement actions successfully achieved by B-Rated premises across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.B.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations_B, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)
#Creating Histogram for C rated establishments
annotations_C <- data.frame(x = c(round(mean(dataset$perc.interventions.achieved.by.C.rated.est), 2)),y = c(0.15), label = c("Mean "))
Plot_c <- ggplot(dataset, aes(perc.interventions.achieved.by.C.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "#0099F8") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby C-rated premises', caption = "Figure 4. The distribution of enforcement actions successfully achieved by C-Rated premises across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.C.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations_C, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)
#Creating Histogram for D rated establishments
annotations_D <- data.frame(x = c(round(mean(dataset$perc.interventions.achieved.by.D.rated.est), 2)),y = c(0.08), label = c("Mean "))
Plot_D <- ggplot(dataset, aes(perc.interventions.achieved.by.D.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "#FFFF00") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby D-rated premises', caption = "Figure 5. The distribution of enforcement actions successfully achieved by D-Rated premises across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.D.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations_D, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)
#Creating Histogram for E rated establishments
annotations_E <- data.frame(x = c(round(mean(dataset$perc.interventions.achieved.by.E.rated.est), 2)),y = c(0.08), label = c("Mean "))
Plot_E <- ggplot(dataset, aes(perc.interventions.achieved.by.E.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "#90EE90") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby E-rated premises', caption = "Figure 6. The distribution of enforcement actions successfully achieved by E-Rated premises across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.E.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations_E, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)
#Returning the plots
ggarrange(Plot_A, Plot_B, Plot_c, Plot_D, Plot_E, nrow = 1, ncol = 1)
## $`1`
##
## $`2`
##
## $`3`
##
## $`4`
##
## $`5`
##
## attr(,"class")
## [1] "list" "ggarrange"
#Deleting rows in the original dataset where NR and NP are present
dataset <- dataset %>%
filter(perc.interventions.achieved.by.A.rated.est != "NR" & perc.of.broadly.comp.A.rated.est != "NP")
#Converting both of the above columns to numeric data type
columns <- c("perc.of.broadly.comp.A.rated.est", "perc.interventions.achieved.by.A.rated.est")
dataset[columns] <- lapply(dataset[columns], as.numeric)
#creating calculated columns necessary for further analysis
dataset <- dataset %>%
#no.of.rated.establishments.given.interventions: Number of rated establishments that are given interventions
mutate(no.of.rated.establishments.given.interventions = (Total.establishments - Establishments.not.yet.rated - Establishments.outside.the.programme) * (100 - perc.of.broadly.comp.rated.est)/100,
#no.of.unrated.establishments.given.interventions: Number of unrated establishments that are given interventions
no.of.unrated.establishments.given.interventions = ((Total.establishments - Establishments.outside.the.programme) * (100-perc.of.broadly.comp.rated.and.unrated.est)/100) - (no.of.rated.establishments.given.interventions),
#proportion.of.successful.responses: Number of rated and unrated establishments that successfully achieved the interventions divided by the total number of establishments that are a part of the programme. This ratio is expressed a a percentage.
proportion.of.successful.responses = (((no.of.rated.establishments.given.interventions*perc.interventions.achieved.by.rated.est/100) + (no.of.unrated.establishments.given.interventions*perc.interventions.achieved.by.unrated.est/100))/(no.of.unrated.establishments.given.interventions + no.of.rated.establishments.given.interventions)) * 100,
#total.no.of.establishments.given.interventions: Total Number of establishments that are given interventions
total.no.of.establishments.given.interventions = no.of.rated.establishments.given.interventions + no.of.unrated.establishments.given.interventions,
#no.of.enforcement.actions.per.type= Average Number of establishments that were given enforcement actions for every type of enforcement action.
no.of.enforcement.actions.per.type = (est.subject.to.enforcement.Voluntary.closure + est.subject.to.enforcement.Seizure.detention.and.surrender.of.food + est.subject.to.enforcement.Suspension.or.revocation.of.approval.or.licence + est.subject.to.enforcement.Hygiene.prohibition.notice + est.subject.to.enforcement.Prohibition.order + est.subject.to.enforcement.Simple.caution + est.subject.to.enforcement.Hygiene.improvement.notices + est.subject.to.enforcement.Remedial.action.and.detention.notices + est.subject.to.written.warnings + est.subject.to.enforcement.Prosecutions.concluded)/10,
#Number.of.employees.per.establishment: Number of Employees allocated per establishment in a particular Local Authority
Number.of.employees.per.establishment = Number.of.FTE.food.safety.employees/(Total.establishments - Establishments.outside.the.programme))
#Checking the distribution of both proportion of successful responses and Number of employees.
histogram_proportion <- ggplot(dataset, aes(proportion.of.successful.responses))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "yellow") + labs(x= 'proportion of successful responses',y='Density', title = 'Distribution', caption = "Figure 7(a). Histogram for proportion of successful responses") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size = 7)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=60, y=0.05, label= sprintf("Skewness: %s", round(skewness(dataset$proportion.of.successful.responses), 2)))
qqplot_proportion <- ggplot(dataset, aes(sample=proportion.of.successful.responses)) +
stat_qq() +
stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 7(b). Q-Q Plot for proportion of successful reponses") + theme(
plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size=7)
)
histogram_employees <- ggplot(dataset, aes(Number.of.FTE.food.safety.employees))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "blue") + labs(x= 'number of FTE food safety employees',y='Density', title = 'Distribution', caption = "Figure 8(a). Histogtam for number of FTE food safety employees") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size = 7)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=15, y=0.1, label= sprintf("Skewness: %s", round(skewness(dataset$Number.of.FTE.food.safety.employees), 2)))
qqplot_employees<- ggplot(dataset, aes(sample=Number.of.FTE.food.safety.employees)) +
stat_qq() +
stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 8(b). Q-Q Plot for number of FTE food safety employees") + theme(
plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size = 7)
)
grid.arrange(histogram_proportion, qqplot_proportion, histogram_employees, qqplot_employees, ncol = 2, nrow = 2, top=textGrob("Distrbution of Proportion of Successful Responses and Number of FTE employees", gp=gpar(fontsize=12, fontface= 2)))
#Kolmogorov-Smirnov test for checking normality
ks.test(dataset$proportion.of.successful.responses, "pnorm", mean = mean(dataset$proportion.of.successful.responses), sd= sd(dataset$proportion.of.successful.responses))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: dataset$proportion.of.successful.responses
## D = 0.17467, p-value = 4.129e-08
## alternative hypothesis: two-sided
ks.test(dataset$Number.of.FTE.food.safety.employees, "pnorm", mean= mean(dataset$Number.of.FTE.food.safety.employees), sd = sd(dataset$Number.of.FTE.food.safety.employees))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: dataset$Number.of.FTE.food.safety.employees
## D = 0.15026, p-value = 4.113e-06
## alternative hypothesis: two-sided
#Since the p values < 0.05, we reject the null hypothesis and claim that the distribution these samples came from are not normal.
#Running spearman correlation test
cor.test(dataset$proportion.of.successful.responses, dataset$Number.of.FTE.food.safety.employees, method= "spearman")
##
## Spearman's rank correlation rho
##
## data: dataset$proportion.of.successful.responses and dataset$Number.of.FTE.food.safety.employees
## S = 4490871, p-value = 0.0747
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1048238
#Visualizing scatterplot between both of these variables, with a 95% CI best fit line
ggplot(dataset, aes(x= Number.of.FTE.food.safety.employees, y=proportion.of.successful.responses)) + geom_point() + geom_smooth(method = lm, fill= "grey") + labs(y = "Proportion of Successful Responses", x = "Number of FTE employees", title = "Scatter Plot between Proportion of Successful Responses \nand Number of FTE employees", caption = "Figure 9. Scatter Plot between Proportion of Successful Responses and Number of FTE employees", color = "legend") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25))
#linear model
test.linear.model <- lm(proportion.of.successful.responses ~ Number.of.FTE.food.safety.employees, data= dataset)
## Linear Regression Diagnostic Plots:
# Residuals vs Fitted Values Plot
residuals_test <- residuals(test.linear.model)
fitted_values_test <- fitted(test.linear.model)
data <- data.frame(residuals_test, fitted_values_test)
residuals.vs.fit.plot_test <- ggplot(data, aes(x = fitted_values_test, y = residuals_test)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(y = "Residuals", x = "Fitted Values", title = "Residuals vs Fitted Values Plot", caption = "Figure 10(a). Residual vs Fitted Values Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5))
#QQ plot for residuals
qq.plot.residuals_test <- ggplot(data, aes(sample=residuals_test)) +
stat_qq() +
stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 10(b). Q-Q Plot for errors") + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25)
)
#Scale-Location Plot
sqrt_fitted_values_test <- sqrt(fitted_values_test)
Scale.location.plot_test <- ggplot(data, aes(x = sqrt_fitted_values_test, y = residuals_test)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(x= "Square Root of Fitted Values",y="Residuals", title = "Scale-Location Plot", caption = "Figure 10(c). Scale Location Plot") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
#Cook's Distance Plot
cooks_distance_test <- cooks.distance(test.linear.model)
data.cooks_test <- data.frame(observation = 1:length(cooks_distance_test), cooks_distance_test)
Cooks.distance.plot_test <- ggplot(data.cooks_test, aes(x = observation, y = cooks_distance_test, color = cooks_distance_test)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") + labs(x= "Observation",y="Cook's Distance", title = "Cook's Distance Plot", caption = "Figure 10(d). Cook's Distance Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
#Residual vs Levarage Plot
leverage_test <- hatvalues(test.linear.model)
data.rl_test <- data.frame(residuals_test, leverage_test, observation = 1:length(leverage_test))
residual.vs.leverage.plot_test <- ggplot(data.rl_test, aes(x = leverage_test, y = residuals_test, color = residuals_test)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") +
geom_hline(yintercept = 0, color = "black", linetype = "dashed") + labs(x= "Leverage",y="Residuals", title = "Residual vs Leverage Plot", caption = "Figure 10(e). Residual vs Leverage Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
# Cook's Distance vs Levarage Plot
cooks_distance_test <- cooks.distance(test.linear.model)
data.cdl_test <- data.frame(cooks_distance_test, leverage_test, observation = 1:length(leverage_test))
Cooks.dist.vs.leverage.plot_test <- ggplot(data.cdl_test, aes(x = leverage_test, y = cooks_distance_test, color = cooks_distance_test)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") + labs(x= "Leverage",y="Cook's Distance", title = "Cook's Distance vs Leverage Plot", caption = "Figure 10(f). Cook's Distance vs Leverage Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
ggarrange(residuals.vs.fit.plot_test, qq.plot.residuals_test, Scale.location.plot_test, Cooks.distance.plot_test, residual.vs.leverage.plot_test, Cooks.dist.vs.leverage.plot_test, ncol = 2)
## $`1`
##
## $`2`
##
## $`3`
##
## attr(,"class")
## [1] "list" "ggarrange"
# Statistical tests to check if assumptions are not violated
#Shapiro-Wilk Normality Test
sresid <- studres(test.linear.model)
shapiro.test(sresid)
##
## Shapiro-Wilk normality test
##
## data: sresid
## W = 0.81682, p-value < 2.2e-16
#studentized Breusch-Pagan test to check homoskedasticity of residuals
lmtest::bptest(test.linear.model)
##
## studentized Breusch-Pagan test
##
## data: test.linear.model
## BP = 0.0086749, df = 1, p-value = 0.9258
#Durbin-Watson test to check if residuals are autocorrelated
durbinWatsonTest(test.linear.model)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.03755358 2.072879 0.526
## Alternative hypothesis: rho != 0
# Errors are heteroskedastic, and the distribution of the errors do not seem normal. Hence, linear regression results will not be reliable.
#In such cases, transforming the variables into normal distributions, and subsequnt outlier treatment solves the problem
# Creating a function that transforms the variable using the transformation technique that makes the variable the most normally distributed.
transformation <- function(x, name="", returnvector = TRUE) {
#if returnvector = FALSE, the transformed variable will be returned as a tibble having the column name specifying what transformation took place.
#if returnvector = TRUE, the transformed variable will be returned as a vector
#name is the prefix of the column name of the transformed variable that will be returned as a tibble when returnvector = FALSE
if (is.numeric(x) == TRUE){
if (skewness(x) >= 0) {
# Create a list of potential transformations
transformations <- list(log = log(x + 1), sqrt = sqrt(x), reciprocal = 1/(x + 1),
square = x^2, cube = x^3, original = x)
# Check the normality of each transformation using the Shapiro-Wilk test
p_values <- sapply(transformations, function(f) {
shapiro.test(f)$p.value
})
# Identify the transformation with the highest p-value (indicating the most normal distribution)
best_transformation <- names(which.max(p_values))
# Return the transformed variable
#creating a tibble of the transformed variable
tibblevector <- tibble(transformations[[best_transformation]])
names(tibblevector) <- c(paste(name,best_transformation))
if (returnvector == FALSE){
return (tibblevector)
}
else{
return (pull(tibblevector, 1))
}
} else if (skewness(x) < 0) {
#if skewness is less than 0, these transformations will work best when the variable is reflected.
#creating the reflected variable using the following formula.
y <- max(x) + 1 - x
# Create a list of potential transformations
transformations <- list(reflected.log = log(y + 1), reflected.sqrt = sqrt(y), reflected.reciprocal = 1/(y + 1),
reflected.square = y^2, reflected.original = y, reflected.cube = y^3, original = x, sqrt = sqrt(x), reciprocal = 1/(x+1), log = log(x + 1))
# Check the normality of each transformation using the Shapiro-Wilk test
p_values <- sapply(transformations, function(f) {
shapiro.test(f)$p.value
})
# Identify the transformation with the highest p-value (indicating the most normal distribution)
best_transformation <- names(which.max(p_values))
# Reflecting back the transformed variable to the original ordering
tibblevector <- tibble(transformations[[best_transformation]])
names(tibblevector) <- c(paste(name,best_transformation))
if (returnvector == FALSE){
return (tibblevector)
}
else{
return (pull(tibblevector, 1))
}
}
} else {
# if the variable is not numeric, simply return the variable.
return(x)
}
}
#Creatin another function which returns the list of column names with their type of transformation done
transformation.info <- function(data=NA){
list_of_transformations <- list()
for (column in names(data)){
column_vector <- pull(data, column)
transformed <- transformation(column_vector, returnvector = FALSE)
if (is.vector(transformed) == FALSE){
list_of_transformations[[column]] <- names(transformed)
}
}
return (list_of_transformations)
}
#Applying the transformation function to both of these variables
transformed.proportion.of.successful.responses <- transformation(dataset$proportion.of.successful.responses, returnvector = FALSE)
names(transformed.proportion.of.successful.responses)
## [1] " reflected.log"
transformed.Number.of.FTE.employees <- transformation(dataset$Number.of.FTE.food.safety.employees, returnvector = FALSE)
names(transformed.Number.of.FTE.employees)
## [1] " log"
#Proportion.of.successful.responses got transformed using reflected log technique
#Number of FTE employees got transformed using log technique.
#Note: Higher the Number of FTE employees, higher will be the log of it.
#Note: Higher the proportion of successful responses, higher will be the reflected log of it.
#Hence, if both variables are positively correlated, they are positively correlated in their actual form as well.
#Outlier Inspection
#Visualizing Box Plots for both proportion of successful response and Number of FTE employees
visualization.tibble <- tibble(transformed.proportion.of.successful.responses = pull(transformed.proportion.of.successful.responses, 1), transformed.Number.of.FTE.employees = pull(transformed.Number.of.FTE.employees,1))
boxplot1 <- ggplot(visualization.tibble, aes(y = transformed.Number.of.FTE.employees)) +
geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
labs(title = "No. of FTE employees", y = "", caption= "Figure 11(a). Box Plot for No of Employees")+
theme_classic() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0.25))
boxplot2 <- ggplot(visualization.tibble, aes(y = transformed.proportion.of.successful.responses)) +
geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
labs(title = "Proportion of successful responses", y = "", caption = "Figure 11(b). Box Plot for proportion of successful responses")+
theme_classic() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0.25))
grid.arrange(boxplot1, boxplot2, ncol = 2, top=textGrob("Box Plots", gp=gpar(fontsize=12, fontface= 2)))
#Outlier Treatment
#Creating a function to treat outliers in a numerical vector using IQR method
outlier.treatment <- function(x, how= "mean"){
#if how = 'mean', the outliers will be replaced by the mean.
#However if how = 'maxmin', outliers will be replaced by the corresponding upper and lower IQR bounds
if (is.numeric(x) == TRUE){
# storing P75 Value
q3 <- quantile(x, 0.75)
#storing P25 value
q1 <- quantile(x, 0.25)
#Storing IQR Value
iqr <- IQR(x)
if (how == "mean"){
#calculating mean after removing outliers
.mean <- mean(x[x <= q3 + iqr & x >= q1 - iqr])
#replacing upper outliers with the mean
x[x > q3 + iqr] <- .mean
#replacing lower outliers with the mean
x[x < q1 - iqr] <- .mean
#returning the treated vector
return(x)
} else if (how == "maxmin"){
#replacing upper outliers with IQR upper bound
x[x > q3 + iqr] <- q3 + iqr
#replacing lower outliers with IQR lower bound
x[x < q1 - iqr] <- q1 - iqr
return(x)
}
}
else{
#if vector is not numerical, simply return the vector
return(x)
}
}
#Applyting the outlier.treatment function to over variables under study
visualization.tibble$transformed.Number.of.FTE.employees <- outlier.treatment(visualization.tibble$transformed.Number.of.FTE.employees)
visualization.tibble$transformed.proportion.of.successful.responses <- outlier.treatment(visualization.tibble$transformed.proportion.of.successful.responses)
#Transforming and subsequently treating outliers for all variables of the dataset and creating a new data set, for future analysis.
transformed.dataset <- tibble(as.data.frame(lapply(dataset, transformation)))
transformed.dataset <- tibble(as.data.frame(lapply(transformed.dataset, outlier.treatment)))
# Returning what kind of transformation was done for each of the variables using UDF
list.of.transformation.info <- transformation.info(data= dataset)
#Storing it as a kable.
table1a <- kable(tibble(Variables = names(list.of.transformation.info), `Transformation applied`= t(as.data.frame(list.of.transformation.info))[,1]), caption = "Table 1. List of variables along with their transformations")
vectorb <- c("log" = " Values were replaced by their Natural Log", "sqrt" = "Values were replaced by their square root values" , "reciprocal" = "Values were replaced by their reciprocal values", "square" = "Values were replaced by their squared values", "cube" = "Values were replaced by their cube values", "original" = "All Values were kept original", "reflected.log" = "Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then natural log was applied to the resultant values", "reflected.sqrt" = "Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then square root was applied to the resultant values", "reflected.reciprocal" = "Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then reciprocal of all of the resultant values was taken", "reflected.square" = "Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and the resultant values were squared", "reflected.cube" = "Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and the resultant values were cubed")
table1b <- kable(tibble(`Transformation applied` = names(vectorb), `Meaning of the transformation` = vectorb), caption = "Table 1(b). Explaination of all transformations")
#Visualizing the scatter plot between these transformed variables
transformed.scatter.plot.between.p.and.n <- ggplot(transformed.dataset, aes(x= Number.of.FTE.food.safety.employees , y= proportion.of.successful.responses)) + geom_point() + geom_smooth(method = lm, fill= "grey") + labs(y = "Proportion of Successful Responses", x = "Number of FTE employees", title = "Scatter Plot between Proportion of Successful Responses \nand Number of FTE employees after transformation", caption = "Figure 12. Scatter Plot between Proportion of Successful Responses and Number of FTE employees after transformation") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25))
transformed.scatter.plot.between.p.and.n
qqplot_transformed_proportion <- ggplot(transformed.dataset, aes(sample=proportion.of.successful.responses)) +
stat_qq() +
stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 13(a). Q-Q Plot for proportion of successful reponses \n(after transformation)") + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
)
histogram_transformed_proportion <- ggplot(transformed.dataset, aes(proportion.of.successful.responses))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "blue") + labs(x= 'proportion of successful responses',y='Density', title = 'Distribution', caption = "Figure 13(b). Histogtam for proportion of successful responses \n(after transformation)") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=3.75, y=0.6, label= sprintf("Skewness: %s", round(skewness(transformed.dataset$proportion.of.successful.responses), 2)), size = 3)
histogram_transformed_employees <- ggplot(transformed.dataset, aes(Number.of.FTE.food.safety.employees))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "blue") + labs(x= 'number of FTE food safety employees',y='Density', title = 'Distribution', caption = "Figure 14(a). Histogtam for number of FTE food safety employees \n(after transformation") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=1.2, y=3, label= sprintf("Skewness: %s", round(skewness(transformed.dataset$Number.of.FTE.food.safety.employees), 2)), size= 3)
qqplot_transformed_employees<- ggplot(transformed.dataset, aes(sample= Number.of.FTE.food.safety.employees)) +
stat_qq() +
stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 14(b). Q-Q Plot for number of FTE food safety employees \n(after transformation)") + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
)
grid.arrange(histogram_transformed_proportion, qqplot_transformed_proportion, histogram_transformed_employees, qqplot_transformed_employees, ncol = 2, nrow = 2, top=textGrob("Distrbution of Proportion of Successful Responses and Number of FTE employees \nafter transformation", gp=gpar(fontsize=12, fontface= 2)))
#testing for normality using Kolmogorov-Smirnov test
ks.test(transformed.dataset$proportion.of.successful.responses, "pnorm", mean=mean(transformed.dataset$proportion.of.successful.responses), sd= sd(transformed.dataset$proportion.of.successful.responses))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: transformed.dataset$proportion.of.successful.responses
## D = 0.075553, p-value = 0.07297
## alternative hypothesis: two-sided
ks.test(transformed.dataset$Number.of.FTE.food.safety.employees, "pnorm", mean=mean(transformed.dataset$Number.of.FTE.food.safety.employees), sd=sd(transformed.dataset$Number.of.FTE.food.safety.employees))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: transformed.dataset$Number.of.FTE.food.safety.employees
## D = 0.077851, p-value = 0.05948
## alternative hypothesis: two-sided
#Since these are continuous variables and roughly normally distributed, we run pearson correlation
cor.test(y = transformed.dataset$proportion.of.successful.responses, x = transformed.dataset$Number.of.FTE.food.safety.employees, method= "pearson")
##
## Pearson's product-moment correlation
##
## data: transformed.dataset$Number.of.FTE.food.safety.employees and transformed.dataset$proportion.of.successful.responses
## t = 1.0152, df = 288, p-value = 0.3108
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0558476 0.1737010
## sample estimates:
## cor
## 0.05971611
#Creating simple linear regression model between both of the variables
simple.linear.model <- lm(proportion.of.successful.responses ~ Number.of.FTE.food.safety.employees, data= transformed.dataset)
summary(simple.linear.model)
##
## Call:
## lm(formula = proportion.of.successful.responses ~ Number.of.FTE.food.safety.employees,
## data = transformed.dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.61006 -0.72858 -0.01466 0.68214 1.92440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9952 0.2471 8.075 1.85e-14 ***
## Number.of.FTE.food.safety.employees 0.1566 0.1542 1.015 0.311
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8472 on 288 degrees of freedom
## Multiple R-squared: 0.003566, Adjusted R-squared: 0.0001062
## F-statistic: 1.031 on 1 and 288 DF, p-value: 0.3108
#estimation approach
cbind(coefficient = coef(simple.linear.model), confint(simple.linear.model))
## coefficient 2.5 % 97.5 %
## (Intercept) 1.9952235 1.5089203 2.4815267
## Number.of.FTE.food.safety.employees 0.1565916 -0.1469945 0.4601778
#Creating a baseline model
baseline.model <- lm(proportion.of.successful.responses ~ 1, data = transformed.dataset)
summary(baseline.model)
##
## Call:
## lm(formula = proportion.of.successful.responses ~ 1, data = transformed.dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.54778 -0.70677 -0.01729 0.69574 1.91070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.24092 0.04975 45.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8473 on 289 degrees of freedom
#Comparing model with baseline model
anova(baseline.model, simple.linear.model)
## Analysis of Variance Table
##
## Model 1: proportion.of.successful.responses ~ 1
## Model 2: proportion.of.successful.responses ~ Number.of.FTE.food.safety.employees
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 289 207.47
## 2 288 206.73 1 0.73984 1.0307 0.3108
## Linear Regression Diagnostic Plots:
# Residuals vs Fitted Values Plot
residuals <- residuals(simple.linear.model)
fitted_values <- fitted(simple.linear.model)
data <- data.frame(residuals, fitted_values)
residuals.vs.fit.plot <- ggplot(data, aes(x = fitted_values, y = residuals)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(y = "Residuals", x = "Fitted Values", title = "Residuals vs Fitted Values Plot", caption = "Figure 15(a). Residual vs Fitted Values Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5))
#QQ plot for residuals
qq.plot.residuals <- ggplot(data, aes(sample=residuals)) +
stat_qq() +
stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 15(b). Q-Q Plot for errors") + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25)
)
#Scale-Location Plot
sqrt_fitted_values <- sqrt(fitted_values)
Scale.location.plot <- ggplot(data, aes(x = sqrt_fitted_values, y = residuals)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(x= "Square Root of Fitted Values",y="Residuals", title = "Scale-Location Plot", caption = "Figure 15(c). Scale Location Plot") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
#Cook's Distance Plot
cooks_distance <- cooks.distance(simple.linear.model)
data.cooks <- data.frame(observation = 1:length(cooks_distance), cooks_distance)
Cooks.distance.plot <- ggplot(data.cooks, aes(x = observation, y = cooks_distance, color = cooks_distance)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") + labs(x= "Observation",y="Cook's Distance", title = "Cook's Distance Plot", caption = "Figure 15(d). Cook's Distance Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
#Residual vs Levarage Plot
leverage <- hatvalues(simple.linear.model)
data.rl <- data.frame(residuals, leverage, observation = 1:length(leverage))
residual.vs.leverage.plot <- ggplot(data.rl, aes(x = leverage, y = residuals, color = residuals)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") +
geom_hline(yintercept = 0, color = "black", linetype = "dashed") + labs(x= "Leverage",y="Residuals", title = "Residual vs Leverage Plot", caption = "Figure 15(e). Residual vs Leverage Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
# Cook's Distance vs Levarage Plot
cooks_distance <- cooks.distance(simple.linear.model)
data.cdl <- data.frame(cooks_distance, leverage, observation = 1:length(leverage))
Cooks.dist.vs.leverage.plot <- ggplot(data.cdl, aes(x = leverage, y = cooks_distance, color = cooks_distance)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") + labs(x= "Leverage",y="Cook's Distance", title = "Cook's Distance vs Leverage Plot", caption = "Figure 15(f). Cook's Distance vs Leverage Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
ggarrange(residuals.vs.fit.plot, qq.plot.residuals, Scale.location.plot, Cooks.distance.plot, residual.vs.leverage.plot, Cooks.dist.vs.leverage.plot, ncol = 2)
## $`1`
##
## $`2`
##
## $`3`
##
## attr(,"class")
## [1] "list" "ggarrange"
# Statistical tests to check if assumptions are not violated
#Shapiro-Wilk Normality Test
sresid <- studres(simple.linear.model)
shapiro.test(sresid)
##
## Shapiro-Wilk normality test
##
## data: sresid
## W = 0.97443, p-value = 4.757e-05
#studentized Breusch-Pagan test to check homoskedasticity of residuals
lmtest::bptest(simple.linear.model)
##
## studentized Breusch-Pagan test
##
## data: simple.linear.model
## BP = 3.8864, df = 1, p-value = 0.04868
#Durbin-Watson test to check if residuals are autocorrelated
durbinWatsonTest(simple.linear.model)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.01787908 2.032779 0.806
## Alternative hypothesis: rho != 0
#Below is a User-Defined Function to select the variables out of all possible variables of the dataset that the simpler model was derived from that best enhances the model's fit.
forward.selection <- function(model1, variableunderstudy, vector.of.variables.to.try, interaction = FALSE, vector.of.variables.for.interaction = NA){
# Define the variables to include in the model
best.model <- model1
linear.model.1.fs <- model1
# Perform forward selection
while (TRUE){
vector.of.terms <- c()
vector.of.F.values <- c()
for (var in vector.of.variables.to.try) {
if (!var%in%names(linear.model.1.fs$model)){
temp_model <- update(linear.model.1.fs, paste0("~. + ",var))
if (any(is.na(temp_model$coefficients)) == FALSE){
vif_value <- vif(temp_model)[variableunderstudy]
if (is.na(vif_value) == TRUE){
vif_value <- vif(temp_model)[variableunderstudy,]["GVIF"]
}
f_value <- anova(linear.model.1.fs, temp_model)[5][2,]
p_value <- anova(linear.model.1.fs, temp_model)[6][2,]
best_model_fvalue <- anova(linear.model.1.fs, best.model)[5][2,]
if (is.na(best_model_fvalue)){
best_model_fvalue <- 0
}
if (vif_value < 5 && p_value < 0.05 && f_value >= 0 && f_value >= best_model_fvalue) {
vector.of.terms <-append(vector.of.terms, var)
vector.of.F.values <- append(vector.of.F.values, f_value)
}
}
}
}
if (length(vector.of.terms) > 0){
best.model <- update(linear.model.1.fs, paste0("~. + ",vector.of.terms[match(max(vector.of.F.values), vector.of.F.values)]))
linear.model.1.fs <- best.model
}
else{
break
}
}
if (interaction == FALSE){
return (linear.model.1.fs)
}
else{
linear.model.1.fs.withinteraction <- linear.model.1.fs
vector.of.interaction.terms <- c()
vector.of.F.values.interaction <- c()
for (var in vector.of.variables.for.interaction) {
temp_model <- update(linear.model.1.fs, paste0("~. + ",paste0(paste0(variableunderstudy, "*"),var)))
if (any(is.na(temp_model$coefficients)) == FALSE){
f_value <- anova(linear.model.1.fs, temp_model)[5][2,]
p_value <- anova(linear.model.1.fs, temp_model)[6][2,]
best_model_fvalue <- anova(linear.model.1.fs, best.model)[5][2,]
if (is.na(best_model_fvalue)){
best_model_fvalue <- 0
}
if (p_value < 0.05 && f_value > 0 && f_value > best_model_fvalue) {
vector.of.interaction.terms <-append(vector.of.interaction.terms, var)
vector.of.F.values.interaction <- append(vector.of.F.values.interaction, f_value)
}
}
}
if (length(vector.of.interaction.terms) > 0){
best.model <- update(linear.model.1.fs, paste0("~. + ",vector.of.interaction.terms[match(max(vector.of.F.values.interaction), vector.of.F.values.interaction)]))
linear.model.1.fs.withinteraction <- best.model
}
return (linear.model.1.fs.withinteraction)
}
}
#Applying the UDF on the model to make the best multiple linear regression model!
mainvariable <- "Number.of.FTE.food.safety.employees"
vars <- colnames(subset(transformed.dataset, select = c(Country, LAType, Total.establishments, Establishments.not.yet.rated, Establishments.outside.the.programme, perc.of.broadly.comp.rated.est, perc.of.broadly.comp.rated.and.unrated.est, A.rated.establishments, perc.of.broadly.comp.A.rated.est, B.rated.establishments, perc.of.broadly.comp.B.rated.est, C.rated.establishments, perc.of.broadly.comp.C.rated.est, D.rated.establishments, perc.of.broadly.comp.D.rated.est, E.rated.establishments, perc.of.broadly.comp.E.rated.est)))
interaction.terms <- c("total.no.of.establishments.given.interventions", "perc.of.broadly.comp.rated.and.unrated.est", "no.of.enforcement.actions.per.type", "Country", "LAType")
#Applying the UDF and creating the model, storing it in multiple.linear.model variable
multiple.linear.model <- forward.selection(model1 = simple.linear.model, variableunderstudy = mainvariable, vector.of.variables.to.try = vars, interaction = TRUE, vector.of.variables.for.interaction = interaction.terms)
#Sumamry output of the best multiple linear regression model
summary(multiple.linear.model)
##
## Call:
## lm(formula = proportion.of.successful.responses ~ Number.of.FTE.food.safety.employees +
## Establishments.not.yet.rated + E.rated.establishments + perc.of.broadly.comp.rated.and.unrated.est,
## data = transformed.dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.56580 -0.60396 -0.03199 0.58447 2.01252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.88846 0.90025 -2.098 0.03681 *
## Number.of.FTE.food.safety.employees -0.33870 0.17303 -1.957 0.05127 .
## Establishments.not.yet.rated 0.15697 0.08013 1.959 0.05108 .
## E.rated.establishments 0.48991 0.16171 3.030 0.00267 **
## perc.of.broadly.comp.rated.and.unrated.est 0.44237 0.14678 3.014 0.00281 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7817 on 285 degrees of freedom
## Multiple R-squared: 0.1606, Adjusted R-squared: 0.1489
## F-statistic: 13.64 on 4 and 285 DF, p-value: 3.479e-10
# p value is not less than 0.05, indicating that relationhip between Number of FTE employees and proportion of successful responses is not significant.
# Visualizing the distribution of variables
histogram_employees_per_establishment <- ggplot(dataset, aes(Number.of.employees.per.establishment))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "blue") + labs(x= 'number of FTE food safety employees per establishment',y='Density', title = 'Distribution', caption = "Figure 16(a). Histogtam for number of FTE food safety employees") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size = 7)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=0.0075, y=200, label= sprintf("Skewness: %s", round(skewness(dataset$Number.of.employees.per.establishment), 2)))
qqplot_employees_per_establishment<- ggplot(dataset, aes(sample=Number.of.employees.per.establishment)) +
stat_qq() +
stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 16(b). Q-Q Plot for number of FTE food safety employees") + theme(
plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size = 7)
)
grid.arrange(histogram_employees_per_establishment, qqplot_employees_per_establishment, ncol = 2, top=textGrob("Distrbution of Number of FTE employees \nper establishment", gp=gpar(fontsize=12, fontface= 2)))
#Kolmogorov-Smirnov test for checking normality
ks.test(dataset$Number.of.employees.per.establishment, "pnorm", mean= mean(dataset$Number.of.employees.per.establishment), sd = sd(dataset$Number.of.employees.per.establishment))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: dataset$Number.of.employees.per.establishment
## D = 0.10284, p-value = 0.004336
## alternative hypothesis: two-sided
#Since the p value < 0.05, we reject the null hypothesis and claim that the distribution this sample came from is not normal.
#Running spearman correlation test
cor.test(dataset$proportion.of.successful.responses, dataset$Number.of.employees.per.establishment, method= "spearman")
##
## Spearman's rank correlation rho
##
## data: dataset$proportion.of.successful.responses and dataset$Number.of.employees.per.establishment
## S = 3131228, p-value = 7.909e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2296695
#Checking the type of best transformation for this variable
list.of.transformation.info$Number.of.employees.per.establishment
## [1] " sqrt"
#Since transformation of this variable is square root and the transformation of proportion of successful responses is reflected log, positive correlation between thee variables will indicate negative correlation in their original form, and vice versa.
#Visualizing scatterplot between both of these variables, with a 95% CI best fit line
ggplot(dataset, aes(x= Number.of.employees.per.establishment, y=proportion.of.successful.responses)) + geom_point() + geom_smooth(method = lm, fill= "grey") + labs(y = "Proportion of Successful Responses", x = "Number of FTE employees per establishment", title = "Scatter Plot between Proportion of Successful Responses \nand Number of FTE employees per est.", caption = "Figure 17. Scatter Plot between Proportion of Successful Responses and Number of FTE employees per est.", color = "legend") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25))
# Errors are heteroskedastic, and the distribution of the errors do not seem normal. Hence, linear regression results will not be reliable.
#In such cases, transforming the variables into normal distributions, and subsequnt outlier treatment solves the problem. We have already transformed and removed outliers before.
#Visualizing the scatter plot between these transformed variables
ggplot(transformed.dataset, aes(x= Number.of.employees.per.establishment , y= proportion.of.successful.responses)) + geom_point() + geom_smooth(method = lm, fill= "grey") + labs(y = "Proportion of Successful Responses", x = "Number of FTE employees per establishment", title = "Scatter Plot between Proportion of Successful Responses \nand Number of FTE employees per establishment after transformation", caption = "Figure 18. Scatter Plot between Proportion of Successful Responses and Number of FTE employees after transformation") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25))
#Visualizing Numebr of employees after transformation
histogram_transformed_employees_per_est <- ggplot(transformed.dataset, aes(Number.of.employees.per.establishment))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "blue") + labs(x= 'number of FTE food safety employees per establishment',y='Density', title = 'Distribution', caption = "Figure 19(a). Histogtam for number of FTE food safety employees \n per establishment(after transformation)") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=0.06, y=60, label= sprintf("Skewness: %s", round(skewness(transformed.dataset$Number.of.employees.per.establishment), 2)), size= 3)
qqplot_transformed_employees_per_est<- ggplot(transformed.dataset, aes(sample= Number.of.employees.per.establishment)) +
stat_qq() +
stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 19(b). Q-Q Plot for number of FTE food safety employees \n per establishment(after transformation)") + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
)
grid.arrange(histogram_transformed_employees_per_est, qqplot_transformed_employees_per_est, ncol = 2, top=textGrob("Distrbution of Number of FTE employees per estblishment \nafter transformation", gp=gpar(fontsize=12, fontface= 2)))
#testing normality
ks.test(transformed.dataset$Number.of.employees.per.establishment, "pnorm", mean= mean(transformed.dataset$Number.of.employees.per.establishment), sd = sd(transformed.dataset$Number.of.employees.per.establishment))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: transformed.dataset$Number.of.employees.per.establishment
## D = 0.057347, p-value = 0.296
## alternative hypothesis: two-sided
#Since these are continuous variables and roughly normally distributed, we run pearson correlation
cor.test(y = transformed.dataset$proportion.of.successful.responses, x = transformed.dataset$Number.of.employees.per.establishment, method= "pearson")
##
## Pearson's product-moment correlation
##
## data: transformed.dataset$Number.of.employees.per.establishment and transformed.dataset$proportion.of.successful.responses
## t = -3.5192, df = 288, p-value = 0.0005029
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.31095707 -0.08997457
## sample estimates:
## cor
## -0.2030499
#Creating simple linear regression model between both of the variables
simple.linear.model.1 <- lm(proportion.of.successful.responses ~ Number.of.employees.per.establishment, data= transformed.dataset)
summary(simple.linear.model.1)
##
## Call:
## lm(formula = proportion.of.successful.responses ~ Number.of.employees.per.establishment,
## data = transformed.dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75608 -0.59873 -0.08871 0.61966 2.22016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4809 0.3557 9.786 < 2e-16 ***
## Number.of.employees.per.establishment -24.1393 6.8594 -3.519 0.000503 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8311 on 288 degrees of freedom
## Multiple R-squared: 0.04123, Adjusted R-squared: 0.0379
## F-statistic: 12.38 on 1 and 288 DF, p-value: 0.0005029
#estimation approach
cbind(coefficient = coef(simple.linear.model.1), confint(simple.linear.model.1))
## coefficient 2.5 % 97.5 %
## (Intercept) 3.480877 2.780766 4.180988
## Number.of.employees.per.establishment -24.139299 -37.640114 -10.638483
#Creating a baseline model
baseline.model <- lm(proportion.of.successful.responses ~ 1, data = transformed.dataset)
summary(baseline.model)
##
## Call:
## lm(formula = proportion.of.successful.responses ~ 1, data = transformed.dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.54778 -0.70677 -0.01729 0.69574 1.91070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.24092 0.04975 45.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8473 on 289 degrees of freedom
#Comparing the model with the baseline model
anova(baseline.model, simple.linear.model.1)
## Analysis of Variance Table
##
## Model 1: proportion.of.successful.responses ~ 1
## Model 2: proportion.of.successful.responses ~ Number.of.employees.per.establishment
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 289 207.47
## 2 288 198.91 1 8.5538 12.385 0.0005029 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Linear Regression Diagnostic Plots:
# Residuals vs Fitted Values Plot
residuals <- residuals(simple.linear.model.1)
fitted_values <- fitted(simple.linear.model.1)
data <- data.frame(residuals, fitted_values)
residuals.vs.fit.plot <- ggplot(data, aes(x = fitted_values, y = residuals)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(y = "Residuals", x = "Fitted Values", title = "Residuals vs Fitted Values Plot", caption = "Figure 20(a). Residual vs Fitted Values Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5))
#QQ plot for residuals
qq.plot.residuals <- ggplot(data, aes(sample=residuals)) +
stat_qq() +
stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 20(b). Q-Q Plot for errors") + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25)
)
#Scale-Location Plot
sqrt_fitted_values <- sqrt(fitted_values)
Scale.location.plot <- ggplot(data, aes(x = sqrt_fitted_values, y = residuals)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(x= "Square Root of Fitted Values",y="Residuals", title = "Scale-Location Plot", caption = "Figure 20(c). Scale Location Plot") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
#Cook's Distance Plot
cooks_distance <- cooks.distance(simple.linear.model.1)
data.cooks <- data.frame(observation = 1:length(cooks_distance), cooks_distance)
Cooks.distance.plot <- ggplot(data.cooks, aes(x = observation, y = cooks_distance, color = cooks_distance)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") + labs(x= "Observation",y="Cook's Distance", title = "Cook's Distance Plot", caption = "Figure 20(d). Cook's Distance Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
#Residual vs Levarage Plot
leverage <- hatvalues(simple.linear.model.1)
data.rl <- data.frame(residuals, leverage, observation = 1:length(leverage))
residual.vs.leverage.plot <- ggplot(data.rl, aes(x = leverage, y = residuals, color = residuals)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") +
geom_hline(yintercept = 0, color = "black", linetype = "dashed") + labs(x= "Leverage",y="Residuals", title = "Residual vs Leverage Plot", caption = "Figure 20(e). Residual vs Leverage Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
# Cook's Distance vs Levarage Plot
cooks_distance <- cooks.distance(simple.linear.model.1)
data.cdl <- data.frame(cooks_distance, leverage, observation = 1:length(leverage))
Cooks.dist.vs.leverage.plot <- ggplot(data.cdl, aes(x = leverage, y = cooks_distance, color = cooks_distance)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") + labs(x= "Leverage",y="Cook's Distance", title = "Cook's Distance vs Leverage Plot", caption = "Figure 20(f). Cook's Distance vs Leverage Plot") +
theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.5)
)
ggarrange(residuals.vs.fit.plot, qq.plot.residuals, Scale.location.plot, Cooks.distance.plot, residual.vs.leverage.plot, Cooks.dist.vs.leverage.plot, ncol = 2)
## $`1`
##
## $`2`
##
## $`3`
##
## attr(,"class")
## [1] "list" "ggarrange"
# Statistical tests to check if assumptions are not violated
#Shapiro-Wilk Normality Test
sresid <- studres(simple.linear.model.1)
shapiro.test(sresid)
##
## Shapiro-Wilk normality test
##
## data: sresid
## W = 0.98466, p-value = 0.003423
#studentized Breusch-Pagan test to check homoskedasticity of residuals
lmtest::bptest(simple.linear.model.1)
##
## studentized Breusch-Pagan test
##
## data: simple.linear.model.1
## BP = 1.3144, df = 1, p-value = 0.2516
#Durbin-Watson test to check if residuals are autocorrelated
durbinWatsonTest(simple.linear.model.1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.00650269 1.984469 0.92
## Alternative hypothesis: rho != 0
#Applying the UDF on the model to make the best multiple linear regression model!
mainvariable <- "Number.of.employees.per.establishment"
vars <- colnames(subset(transformed.dataset, select = c(Country, LAType, Total.establishments, Establishments.not.yet.rated, Establishments.outside.the.programme, perc.of.broadly.comp.rated.est, perc.of.broadly.comp.rated.and.unrated.est, A.rated.establishments, perc.of.broadly.comp.A.rated.est, B.rated.establishments, perc.of.broadly.comp.B.rated.est, C.rated.establishments, perc.of.broadly.comp.C.rated.est, D.rated.establishments, perc.of.broadly.comp.D.rated.est, E.rated.establishments, perc.of.broadly.comp.E.rated.est)))
interaction.terms <- c("total.no.of.establishments.given.interventions", "perc.of.broadly.comp.rated.and.unrated.est", "Country", "LAType")
multiple.linear.model.1 <- forward.selection(model1 = simple.linear.model.1, variableunderstudy = mainvariable, vector.of.variables.to.try = vars, interaction = TRUE, vector.of.variables.for.interaction = interaction.terms)
summary(multiple.linear.model.1)
##
## Call:
## lm(formula = proportion.of.successful.responses ~ Number.of.employees.per.establishment +
## Establishments.not.yet.rated + perc.of.broadly.comp.rated.and.unrated.est +
## E.rated.establishments, data = transformed.dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.54778 -0.61265 -0.05003 0.58788 2.03381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.41779 1.04895 -0.398 0.6907
## Number.of.employees.per.establishment -11.15271 6.91707 -1.612 0.1080
## Establishments.not.yet.rated 0.12848 0.08084 1.589 0.1131
## perc.of.broadly.comp.rated.and.unrated.est 0.42800 0.14709 2.910 0.0039 **
## E.rated.establishments 0.28490 0.14392 1.980 0.0487 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7834 on 285 degrees of freedom
## Multiple R-squared: 0.157, Adjusted R-squared: 0.1452
## F-statistic: 13.27 on 4 and 285 DF, p-value: 6.259e-10
This report presents the results of the following analyses requested by the board:
Visualize the distributions across local authorities of the % of enforcement actions successfully achieved.
Examine whether employing more professional enforcement officers increases the likelihood of establishments successfully responding to enforcement actions by:
Determining whether there is a relationship between proportion of successful responses and the number of FTE food safety employees in each local authority.
Determining whether there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority
Before conducting further analysis, variables were appropriately transformed. Table 1 illustrates the transformation that was applied to each variable. Note that Proportion of Successful responses were transformed using “reflected.log” technique (refer Table 1). This means that an increase in the transformed value of proportion of successful response indicates a decrease in its original value. Moreover, Number of employees were transformed using “log”, i.e., the original values were replaced by their natural log values. This means that an increase in the transformed value will indicate an increase in its original value. Hence, a positive relationship between these transformed variables will indicate that there is a negative relationship in their original values, and vice-versa.
| Variables | Transformation applied |
|---|---|
| Total.establishments | log |
| Establishments.not.yet.rated | log |
| Establishments.outside.the.programme | log |
| perc.of.broadly.comp.rated.est | reflected.log |
| perc.of.broadly.comp.rated.and.unrated.est | reflected.log |
| A.rated.establishments | reciprocal |
| perc.of.broadly.comp.A.rated.est | sqrt |
| B.rated.establishments | log |
| perc.of.broadly.comp.B.rated.est | reflected.sqrt |
| C.rated.establishments | log |
| perc.of.broadly.comp.C.rated.est | reflected.sqrt |
| D.rated.establishments | log |
| perc.of.broadly.comp.D.rated.est | reflected.reciprocal |
| E.rated.establishments | log |
| perc.of.broadly.comp.E.rated.est | reflected.reciprocal |
| perc.interventions.achieved.by.rated.est | reflected.log |
| perc.interventions.achieved.by.A.rated.est | reflected.reciprocal |
| perc.interventions.achieved.by.B.rated.est | reflected.log |
| perc.interventions.achieved.by.C.rated.est | reflected.log |
| perc.interventions.achieved.by.D.rated.est | reflected.log |
| perc.interventions.achieved.by.E.rated.est | reflected.log |
| perc.interventions.achieved.by.unrated.est | reflected.log |
| est.subject.to.enforcement.Voluntary.closure | log |
| est.subject.to.enforcement.Seizure.detention.and.surrender.of.food | reciprocal |
| est.subject.to.enforcement.Suspension.or.revocation.of.approval.or.licence | reciprocal |
| est.subject.to.enforcement.Hygiene.prohibition.notice | reciprocal |
| est.subject.to.enforcement.Prohibition.order | reciprocal |
| est.subject.to.enforcement.Simple.caution | reciprocal |
| est.subject.to.enforcement.Hygiene.improvement.notices | log |
| est.subject.to.enforcement.Remedial.action.and.detention.notices | reciprocal |
| est.subject.to.written.warnings | log |
| est.subject.to.enforcement.Prosecutions.concluded | reciprocal |
| Number.of.FTE.food.safety.employees | log |
| no.of.rated.establishments.given.interventions | log |
| no.of.unrated.establishments.given.interventions | log |
| proportion.of.successful.responses | reflected.log |
| total.no.of.establishments.given.interventions | log |
| no.of.enforcement.actions.per.type | log |
| Number.of.employees.per.establishment | sqrt |
| Transformation applied | Meaning of the transformation |
|---|---|
| log | Values were replaced by their Natural Log |
| sqrt | Values were replaced by their square root values |
| reciprocal | Values were replaced by their reciprocal values |
| square | Values were replaced by their squared values |
| cube | Values were replaced by their cube values |
| original | All Values were kept original |
| reflected.log | Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then natural log was applied to the resultant values |
| reflected.sqrt | Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then square root was applied to the resultant values |
| reflected.reciprocal | Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then reciprocal of all of the resultant values was taken |
| reflected.square | Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and the resultant values were squared |
| reflected.cube | Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and the resultant values were cubed |
Pearson correlation test was conducted which showed a low correlation coefficient value of 0.05, indicating a weak positive relationship between these variables. Moreover, this relationship is not statistically significant (p < 0.05). Hence, we fail to reject the hypothesis that true correlation of both of the transformed variables is 0.
For further investigation, the simple linear regression model between both of these transformed variables was constructed. According to the results, it was noted that there is an increase of 0.51 units in the transformed value of proportion of successful responses per unit increase in the transformed value of number of employees. However, this increase is not significantly different from 0, t(288) = 1.015, p > 0.05. Hence we fail to reject the null hypothesis that without taking control variables into consideration, there is no relationship between both the variables.
The scatter plot (Figure 11) explains the regression result using Estimation approach. The best fit line indicates a positive relationship between proportion of successful responses and the number of FTE employees after being transformed and treated for outliers. However, the 95% CI area for the best fit line includes the line with slope=0, indicating that it is likely that there is no true linear relationship between the two variables when using the population data. Therefore, the null hypothesis that there is no relationship between the two variables cannot be rejected.
To further evaluate the relationship between Number of FTE employees and proportion of successful responses, a forward feature-selection algorithm was created to build the best multiple linear regression model for answering the research question. The model will best help in understanding the relationship between Number of FTE employees and proportion of successful responses when controlling for other significant variables and also understand any interactive effects between Number of FTE employees and any of the candidates selected to make an interaction term with.
Variables such as Total Number of establishments that were given interventions, Percentage of broadly compliant establishments, Number of Enforcement Actions, Country and LAType were selected as potential candidates with which Number of FTE employees may showcase interaction effects with. These specific variables were selected because they may have an impact on the relationship due to factors such as resources and capacity to implement interventions, level of compliance and non-compliance, and resources available to the local authority.
Moreover, the possible candidates for control variables were only limited to those variables which can be considered as leading indicators for the value of proportion of successful responses.
The best multiple linear regression model revealed no interactive term and a decrease in the transformed values of proportion of successful responses by 0.33 units with a unit increase in the transformed value of number of employees, when controlling for other variables. This decrease is not significantly different from 0, t(285) = -1.957, p > 0.05, leading to the conclusion that there is no relationship between the two variables.
However, there might exist a relationship between Number of employees allocated per establishment and proportion of successful responses.Number of employees allocated per establishment was calculated using the following formula:
\[{\text{Number of Employees per establishment}} = \frac{\text{Number of employees}}{\text{Total Number of Establishments}}\]
As per Table 1, Number of employees per establishment was transformed using “sqrt” method. Since the increase in its transformed value will indicate an increase in its original value, it is important to note for further analysis that a positive relationship between the transformed values of Number of employees per establishment and proportion of successful responses will indicate a negative relationship in their original values.
Pearson correlation test between both of these transformed variables indicated negative correlation coefficient value of -0.2, but was significantly different from 0 (p < 0.05), thus rejecting the null hypothesis that the true correlation is 0.
Simple linear regression model between both of these variables found a significant decrease of 24.13 in the transformed value of proportion of successful responses per unit increase in the transformed value of number of employees per establishment. This decrease is significantly different from 0, t(288)=9.786, p<0.001.
Additionally, the scatter plot (Figure 18) explains the regression result using Estimation approach. The best fit line indicates a negative relationship between proportion of successful responses and the number of FTE employees after being transformed and treated for outliers. Moreover, the 95% CI area for the best fit line does not include the line with slope=0, indicating that there will be true linear relationship between the two variables when using the population data.
To further evaluate the relationship, the same forward-feature-selection algorithm was used with the same candidates for control variables and interaction terms.
In this case, the best multiple regression model revealed that there was no significant relationship between the transformed values of number of employees per establishment and proportion of successful responses after controlling for other variables, \(t(285) = -1.612, p > 0.05\). The control variables captured the effect that the number of employees has on the proportion of successful responses, making the effect of the number of employees redundant. Hence, increasing the number of employees will not necessarily increase the likelihood of successful responses to enforcement actions. The relationship between the two variables was statistically insignificant after including control variables in the linear regression model.
Hence, we do not have sufficient evidence to suggest that increasing number of employees or increasing number of employees per establishment will increase the likelihood of establishments successfully responding to enforcement actions.
This is to certify that the work I am submitting is my own. All external references and sources are clearly acknowledged and identified within the contents. I am aware of the University of Warwick regulation concerning plagiarism and collusion.
No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done an appropriate reduction in the mark I might otherwise have received will be made
This report is designed to meet the requests of managers of a publishing company, undertaking the specific analyses requested. This reports aims to:
Examine whether books from different genres have different daily sales.
Examine whether the sales of the books depend on their average review scores and total number of reviews.
Examine the effect of sales price of the book on the number of sales for the book and if it differs for books belonging to different genres.
This dataset is provided by a Publishing Company. The variables in the dataset as described below.
| Variable | Description |
|---|---|
| sold by | Name of the publisher selling the books |
| publisher.type | Type of publisher selling the book |
| genre | Genre or Category of the book |
| avg.review | Average review of the book by the readers (out of 5) |
| daily.sales | average number of sales (minus refunds) across all days in the period |
| total.reviews | Total number of reviews given for the book by the readers |
| sale.price | Average price for which the book is sold across all sales in the period |
#Loading the required original dataset for the project and creating a dataframe.
publishing.data <- read_csv("publisher_sales.csv")
#Rename the columns in the dataset that have special characters or other insignificant characters to improve the readibility of the dataset.
names(publishing.data)[names(publishing.data) == 'sold by'] <- 'sold.by'
#Checking the summary of the data
summary(publishing.data)
## sold.by publisher.type genre avg.review daily.sales
## Length:6000 Length:6000 Length:6000 Min. :0.000 Min. : -0.53
## Class :character Class :character Class :character 1st Qu.:4.100 1st Qu.: 56.77
## Mode :character Mode :character Mode :character Median :4.400 Median : 74.29
## Mean :4.267 Mean : 79.11
## 3rd Qu.:4.620 3rd Qu.: 98.02
## Max. :4.980 Max. :207.98
## total.reviews sale.price
## Min. : 0.0 Min. : 0.740
## 1st Qu.:105.0 1st Qu.: 7.140
## Median :128.0 Median : 8.630
## Mean :132.6 Mean : 8.641
## 3rd Qu.:163.0 3rd Qu.:10.160
## Max. :248.0 Max. :17.460
#From the summary it is understood that sold.by, publisher.type and genre are character variables and others are numeric variables.
#As publisher.type, genre and sold.by are columns based on different factors, it is better to convert this column into as factor from character format.
publishing.data$publisher.type <- as.factor(publishing.data$publisher.type)
levels(publishing.data$publisher.type)
## [1] "amazon" "big five" "indie" "single author" "small/medium"
#Using levels it can be found there are 5 factors into which the data in publisher.type column are divided.
publishing.data$sold.by <- as.factor(publishing.data$sold.by)
levels(publishing.data$sold.by)
## [1] "Amazon Digital Services, Inc." "Cengage Learning"
## [3] "DC Comics" "Hachette Book Group"
## [5] "HarperCollins Christian Publishing" "HarperCollins Publishers"
## [7] "HarperCollins Publishing" "Idea & Design Works"
## [9] "Macmillan" "Penguin Group (USA) LLC"
## [11] "Random House LLC" "RCS MediaGroup S.p.A."
## [13] "Simon and Schuster Digital Sales Inc"
#Using levels it can be seen that there are 13 different sellers of books present in the dataset.
publishing.data$genre <- as.factor(publishing.data$genre)
levels(publishing.data$genre)
## [1] "childrens" "fiction" "non_fiction"
#Using levels it can be seen that there are books related to 3 different genres being sold in the store.
#Checking for NA values in specific columns
summarise_all(publishing.data, ~ sum(is.na(.x)))
## # A tibble: 1 × 7
## sold.by publisher.type genre avg.review daily.sales total.reviews sale.price
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0
#since there are no NA values, there is no need of further cleaning for this dataset.
#Null Hypothesis of the model: Books from different genres have the same daily sales on average
#Alternate Hypothesis of model: Books from different genres have different daily sales on average
#Plotting daily sales according to different genres to check for distribution of the dales sales based on different genres.
genre.sales <- ggplot(publishing.data, aes(x=daily.sales, fill=genre)) + geom_histogram(position="identity", alpha=0.5, binwidth=10) + labs(x= 'Daily Sales',y='Frequency', title = 'Distribution of number of sales for different genres', caption = "Figure 1. Distribution of number of sales for different genres") + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0.25)
) + scale_fill_discrete(labels=c('Children', 'Fiction', 'Non-Fiction'))
genre.sales
#Visualizing Box Plot for daily sales
ggplot(publishing.data, aes(y = daily.sales)) +
geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
labs(title = "Box Plot of daily sales", y = "", caption= "Figure 2. Box Plot of daily sales")+
theme_classic() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0.25))
#treating outliers:
publishing.data$daily.sales <- outlier.treatment(publishing.data$daily.sales)
#Formulating the regression model to calculate the effect of genre on daily sales
m.sales.by.genre <- lm(daily.sales ~ genre, data = publishing.data)
#Using summary to find the coefficient values and p-values gained from running the model.
summary(m.sales.by.genre)
##
## Call:
## lm(formula = daily.sales ~ genre, data = publishing.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.219 -13.623 0.154 13.334 60.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.8970 0.4398 127.09 <2e-16 ***
## genrefiction 40.7922 0.6220 65.58 <2e-16 ***
## genrenon_fiction 19.9589 0.6220 32.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.67 on 5997 degrees of freedom
## Multiple R-squared: 0.4177, Adjusted R-squared: 0.4175
## F-statistic: 2151 on 2 and 5997 DF, p-value: < 2.2e-16
#Calculating the estimated marginal means for the model to get means for each genre of the model.
( m.sales.by.genre.emm <- emmeans(m.sales.by.genre, ~genre) )
## genre emmean SE df lower.CL upper.CL
## childrens 55.9 0.44 5997 55.0 56.8
## fiction 96.7 0.44 5997 95.8 97.6
## non_fiction 75.9 0.44 5997 75.0 76.7
##
## Confidence level used: 0.95
#Using pairs and confint within emmeans to find the 95% confidence intervals
( m.sales.by.genre.pairs <- confint(pairs(m.sales.by.genre.emm)) )
## contrast estimate SE df lower.CL upper.CL
## childrens - fiction -40.8 0.622 5997 -42.3 -39.3
## childrens - non_fiction -20.0 0.622 5997 -21.4 -18.5
## fiction - non_fiction 20.8 0.622 5997 19.4 22.3
##
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
#Conducting Anova to test whether the variable 'genre' has a significant effect on 'daily sales'
anova(m.sales.by.genre)
## Analysis of Variance Table
##
## Response: daily.sales
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 2 1664260 832130 2150.7 < 2.2e-16 ***
## Residuals 5997 2320322 387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Plotting estimates for each genre
p.genre <- ggplot(summary(m.sales.by.genre.emm), aes(x=genre, y=emmean, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() + labs(x="Genre", y="Daily Sales", subtitle="Error Bars are Extent of 95% CIs", caption = "Figure 3(a). Estimated Means of daily sales for \ndifferent genres") +
scale_x_discrete(guide = guide_axis(n.dodge=2)) + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 9, face = "bold", hjust=0.5),
plot.caption = element_text(face = "italic", hjust = 0.25)
)
#Plotting the differences in estimates based on genre
p.contrasts <- ggplot(m.sales.by.genre.pairs, aes(x=contrast, y=estimate, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() + labs(x="Contrast", y="Difference in Daily Sales", subtitle="Error Bars are Extent of 95% CIs", caption = "Figure 3(b). Estimated differences in daily sales \nmeans between genres") +
scale_x_discrete(guide = guide_axis(n.dodge=2)) + theme_classic() + theme(
plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 9, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0.25)
)
#Using Grid Arrange to plot both plots simulataneously
grid.arrange(p.genre, p.contrasts, ncol=2, top=textGrob("Differences in daily sales number estimates based on genre", gp=gpar(fontsize=12, fontface= 2)))
#Null Hypothesis of the model: Books have the same amount of sales depending on their average review scores and total number of reviews
#Alternate Hypothesis of model: Books have fewer/more sales depending on their average review scores and total number of reviews
#Outlier inspection
#average review scores
ggplot(publishing.data, aes(y = avg.review)) +
geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
labs(title = "Box Plot of average reviews", y = "", caption= "Figure 4. Box Plot for average reviews")+
theme_classic() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0.25))
#outlier treatment
publishing.data$avg.review <- outlier.treatment(publishing.data$avg.review)
#total number of reviews
ggplot(publishing.data, aes(y = total.reviews)) +
geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
labs(title = "Box Plot of total number of reviews", y = "", caption= "Figure 5. Box Plot for total number of reviews")+
theme_classic() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0.25))
publishing.data$total.reviews <- outlier.treatment(publishing.data$total.reviews)
#Formulating a multi regression model to calculate the effect of average reviews and total number of reviews on daily sales
m.sales.by.review <- lm(daily.sales ~ avg.review + total.reviews, data = publishing.data)
#Using summary to find the coefficient values and p-values gained from running the model.
summary(m.sales.by.review)
##
## Call:
## lm(formula = daily.sales ~ avg.review + total.reviews, data = publishing.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.860 -14.209 -0.608 13.715 69.030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.953805 3.974847 3.762 0.00017 ***
## avg.review 0.391633 0.875518 0.447 0.65466
## total.reviews 0.446859 0.007189 62.157 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.1 on 5997 degrees of freedom
## Multiple R-squared: 0.3919, Adjusted R-squared: 0.3917
## F-statistic: 1932 on 2 and 5997 DF, p-value: < 2.2e-16
#Conducting Anova to test whether the variable 'total reviews' and 'average review' has a significant effect on 'daily sales'
anova(m.sales.by.review)
## Analysis of Variance Table
##
## Response: daily.sales
## Df Sum Sq Mean Sq F value Pr(>F)
## avg.review 1 488 488 1.2072 0.2719
## total.reviews 1 1561020 1561020 3863.4562 <2e-16 ***
## Residuals 5997 2423074 404
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Using the `coef()` and `confint()` functions to get the confidence intervals for the estimation approach.
cbind(coefficient=coef(m.sales.by.review), confint(m.sales.by.review))
## coefficient 2.5 % 97.5 %
## (Intercept) 14.9538048 7.1616760 22.7459335
## avg.review 0.3916333 -1.3246964 2.1079630
## total.reviews 0.4468588 0.4327654 0.4609523
#To check for interaction effect a regression model with an interaction term can be formulated.
#Formulating a multi regression model to calculate the effect of average reviews and total number of reviews on daily sales as an interaction term.
m.sales.by.review.interaction <- lm(daily.sales ~ avg.review*total.reviews, data = publishing.data)
#Using summary to find the coefficient values and p-values gained from running the model.
summary(m.sales.by.review.interaction)
##
## Call:
## lm(formula = daily.sales ~ avg.review * total.reviews, data = publishing.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.855 -14.206 -0.576 13.749 69.030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.21342 14.84263 1.497 0.134551
## avg.review -1.25373 3.35731 -0.373 0.708840
## total.reviews 0.39240 0.10751 3.650 0.000264 ***
## avg.review:total.reviews 0.01234 0.02431 0.508 0.611717
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.1 on 5996 degrees of freedom
## Multiple R-squared: 0.3919, Adjusted R-squared: 0.3916
## F-statistic: 1288 on 3 and 5996 DF, p-value: < 2.2e-16
#Conducting Anova to test whether the variable 'total reviews' and 'average review' has a significant effect on 'daily sales' when used as an interaction term
anova(m.sales.by.review.interaction)
## Analysis of Variance Table
##
## Response: daily.sales
## Df Sum Sq Mean Sq F value Pr(>F)
## avg.review 1 488 488 1.2070 0.2720
## total.reviews 1 1561020 1561020 3862.9780 <2e-16 ***
## avg.review:total.reviews 1 104 104 0.2577 0.6117
## Residuals 5996 2422970 404
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Using the `coef()` and `confint()` functions to get the confidence intervals for the estimation approach.
cbind(coefficient=coef(m.sales.by.review.interaction), confint(m.sales.by.review.interaction))
## coefficient 2.5 % 97.5 %
## (Intercept) 22.21342362 -6.8834794 51.31032660
## avg.review -1.25372619 -7.8352639 5.32781148
## total.reviews 0.39240480 0.1816513 0.60315826
## avg.review:total.reviews 0.01233938 -0.0353108 0.05998957
#checking for multicolinearity in the model
vif(m.sales.by.review)
## avg.review total.reviews
## 1.00011 1.00011
#determining outliers
ggplot(publishing.data, aes(y = sale.price)) +
geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
labs(title = "Box Plot of sales price", y = "", caption= "Figure 6. Box Plot for sales price")+
theme_classic() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
plot.caption = element_text(face = "italic", hjust = 0.25))
#treating outliers
publishing.data$sale.price <- outlier.treatment(publishing.data$sale.price)
#Creating a multi regression model to calculate the effect of sale price on daily sales
m.sales.by.saleprice <- lm(daily.sales ~ sale.price, data = publishing.data)
summary(m.sales.by.saleprice)
##
## Call:
## lm(formula = daily.sales ~ sale.price, data = publishing.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.495 -17.850 -2.971 15.564 70.537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.3353 1.4780 70.59 <2e-16 ***
## sale.price -3.2750 0.1676 -19.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.99 on 5998 degrees of freedom
## Multiple R-squared: 0.05987, Adjusted R-squared: 0.05971
## F-statistic: 381.9 on 1 and 5998 DF, p-value: < 2.2e-16
anova(m.sales.by.saleprice)
## Analysis of Variance Table
##
## Response: daily.sales
## Df Sum Sq Mean Sq F value Pr(>F)
## sale.price 1 238546 238546 381.95 < 2.2e-16 ***
## Residuals 5998 3746036 625
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
( m.sales.by.saleprice.emm <- emmeans(m.sales.by.saleprice, ~sale.price))
## sale.price emmean SE df lower.CL upper.CL
## 8.61 76.1 0.323 5998 75.5 76.8
##
## Confidence level used: 0.95
#Creating a multi regression model to calculate the effect of sales price on daily sales for different genres
m.sales.by.saleprice.genre <- lm(daily.sales ~ sale.price + genre, data = publishing.data)
summary(m.sales.by.saleprice.genre)
##
## Call:
## lm(formula = daily.sales ~ sale.price + genre, data = publishing.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.222 -13.572 0.048 13.194 59.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.9984 1.4549 41.927 < 2e-16 ***
## sale.price -0.5272 0.1433 -3.678 0.000237 ***
## genrefiction 39.9723 0.6601 60.551 < 2e-16 ***
## genrenon_fiction 19.0866 0.6651 28.698 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.65 on 5996 degrees of freedom
## Multiple R-squared: 0.419, Adjusted R-squared: 0.4187
## F-statistic: 1441 on 3 and 5996 DF, p-value: < 2.2e-16
anova(m.sales.by.saleprice.genre)
## Analysis of Variance Table
##
## Response: daily.sales
## Df Sum Sq Mean Sq F value Pr(>F)
## sale.price 1 238546 238546 617.82 < 2.2e-16 ***
## genre 2 1430938 715469 1853.03 < 2.2e-16 ***
## Residuals 5996 2315098 386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
( m.sales.by.saleprice.genre.emm <- emmeans(m.sales.by.saleprice.genre, ~sale.price+genre))
## sale.price genre emmean SE df lower.CL upper.CL
## 8.61 childrens 56.5 0.465 5996 55.5 57.4
## 8.61 fiction 96.4 0.445 5996 95.6 97.3
## 8.61 non_fiction 75.5 0.447 5996 74.7 76.4
##
## Confidence level used: 0.95
( m.sales.by.saleprice.genre.pairs <- confint(pairs(m.sales.by.saleprice.genre.emm)) )
## contrast estimate SE df
## sale.price8.60705074730622 childrens - sale.price8.60705074730622 fiction -40.0 0.660 5996
## sale.price8.60705074730622 childrens - sale.price8.60705074730622 non_fiction -19.1 0.665 5996
## sale.price8.60705074730622 fiction - sale.price8.60705074730622 non_fiction 20.9 0.622 5996
## lower.CL upper.CL
## -41.5 -38.4
## -20.6 -17.5
## 19.4 22.3
##
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
# constructing linear regression model using interaction term.
m.interaction <- lm(daily.sales ~ sale.price*genre, data = publishing.data)
summary(m.interaction)
##
## Call:
## lm(formula = daily.sales ~ sale.price * genre, data = publishing.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.089 -13.654 0.154 13.207 57.498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.4396 2.4672 28.146 < 2e-16 ***
## sale.price -1.3995 0.2509 -5.578 2.54e-08 ***
## genrefiction 26.6920 3.2363 8.248 < 2e-16 ***
## genrenon_fiction 8.5527 3.1654 2.702 0.00691 **
## sale.price:genrefiction 1.4681 0.3557 4.127 3.72e-05 ***
## sale.price:genrenon_fiction 1.1332 0.3479 3.257 0.00113 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.62 on 5994 degrees of freedom
## Multiple R-squared: 0.4208, Adjusted R-squared: 0.4203
## F-statistic: 871 on 5 and 5994 DF, p-value: < 2.2e-16
anova(m.interaction)
## Analysis of Variance Table
##
## Response: daily.sales
## Df Sum Sq Mean Sq F value Pr(>F)
## sale.price 1 238546 238546 619.5580 < 2.2e-16 ***
## genre 2 1430938 715469 1858.2369 < 2.2e-16 ***
## sale.price:genre 2 7255 3627 9.4212 8.22e-05 ***
## Residuals 5994 2307844 385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(m.sales.by.saleprice.genre.emm.i <- emmeans(m.interaction, ~sale.price*genre))
## sale.price genre emmean SE df lower.CL upper.CL
## 8.61 childrens 57.4 0.514 5994 56.4 58.4
## 8.61 fiction 96.7 0.456 5994 95.8 97.6
## 8.61 non_fiction 75.7 0.461 5994 74.8 76.6
##
## Confidence level used: 0.95
anova(m.sales.by.saleprice, m.sales.by.saleprice.genre)
## Analysis of Variance Table
##
## Model 1: daily.sales ~ sale.price
## Model 2: daily.sales ~ sale.price + genre
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5998 3746036
## 2 5996 2315098 2 1430938 1853 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This report is designed to meet the requests of managers of a publishing company, undertaking the specific analyses requested. This reports aims to:
Examine whether books from different genres have different daily sales.
Examine whether the sales of the books depend on their average review scores and total number of reviews.
Examine the effect of sales price of the book on the number of sales for the book and if it differs for books belonging to different genres.
The graph in Figure 1 displays the distribution of daily sales across different genres. As depicted, the Mean of the distributions varies depending on the genre.
An ANOVA analysis was conducted and it was found that daily sales of books differ greatly based on genre, \(F(5997)= 2150.7\), \(p < 0.0001\). Furthermore, when comparing the mean value of daily sales for children (55.6) to other genres, it was determined that fiction has a significantly higher daily sales average of 50.30, $t(5997) = 71.53, p < 0.0001. Similarly, the average daily sales for non-fiction was found to be 20.28 higher than that of children, \(t(5997) = 28.85, p <0.001\). Thus, through Null Hypothesis significance testing, it was rejected that the true mean values of daily sales are the same for all genres.
The data analyzed in Figure 3(a) and Figure 3(b) both demonstrate that there is a significant difference in the mean daily sales of books among different genres. The 95% confidence intervals of the population means do not overlap, indicating that it is highly unlikely that the true population means for the daily sales of each genre are the same. Additionally, the figure illustrates that the differences between the means are substantial. These findings lead us to reject the null hypothesis that the average daily sales for all genres are identical, and instead suggest that books from different genres have distinct daily sales on average.
An examination was conducted to determine the influence of the average review score and the total number of reviews on the daily sales of books. The results show that when keeping the average review constant, the daily sales of books vary greatly depending on the total number of reviews, \(t(5997) = 62.16, p < 0.0001\). This suggests that on average, the daily sales will increase by 0.44 e-books for each unit increase in total reviews, while holding the average review constant. Additionally, the estimated 95% CI range for the change of daily sales for every 1 unit increase in the number of ratings, while keeping the average reviews constant, does not include 0, indicating that for the population data, an increase in one unit of total reviews will increase the daily sales. However, when keeping the total number of reviews constant, the daily sales of books do not vary significantly depending on the average reviews, \(t(5997) = 0.44, p > 0.05\). Thus, we fail to reject the null hypothesis that there is no correlation between average review and daily sales when controlling for total reviews. Furthermore, the estimated 95% CI range for the change of daily sales for every 1 unit increase in the average review, while keeping total number of reviews constant, does include 0. This suggests that for the population data, an increase in one unit of average review is unlikely to change daily sales when controlling for total number of reviews. In conclusion, this analysis demonstrates that the number of ratings is more important than the average rating when it comes to increasing the daily sales of books.
An additional examination was carried out to investigate whether the influence of the number of reviews on daily sales varied for different average reviews. The results of the analysis revealed that the effect was not statistically different for various average reviews, \(t(5996) = 0.508, p > 0.05\). Therefore, we failed to reject the null hypothesis that the impact of the number of reviews on sales does not rely on the average review of the book. Furthermore, the estimated 95% CI range of the change in the effect of the number of reviews on daily sales for every 1 unit increase in the average review includes 0. This suggests that for the population data, it is highly probable that an increase of 1 unit in the average review will not affect the effect of the number of reviews on daily sales.
Initially, an analysis was conducted to investigate the relationship between sale price and daily sales. It was seen that on average, the daily sales would drop by 3.8 e-books for every 1 dollar rise in the retail price \(t(5998)= -18.78, p < 0.0001\). The estimated 95% CI of the change in daily sales for every $1 increase in the retail price does not include 0, further supporting the negative relationship. This indicates that the slope of the best fit regression line will be negative for the population data as well.
Finally, a more complex regression analysis was conducted to determine if the relationship between sales price and daily sales differs among different genres of e-books. The results of the analysis revealed that there is a significant difference in the effect of sales price on daily sales for different genres, F(5994) = 9.42, p < 0.001. Specifically, the analysis found that the difference in the increase of daily sales for a 1 unit decrease in sales price between non-fiction and children’s genres is 1.13, which is statistically significant, \(t(5994) = 3.257, p < 0.01\). Additionally, the difference in the increase of daily sales for a 1 unit decrease in sales price between fiction and children’s genres is 1.46, which is also statistically significant, \(t(5994) = 4.127, p < 0.001\). These results lead us to reject the null hypothesis that the relationship between sales price and daily sales is the same for all genres. Furthermore, the estimated 95% CI range of the change in the effect of sales price on daily sales for different genres does not include 0, indicating that the relationship between sales price and daily sales is likely to differ among different genres in the population.